In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

To visualize the workings of machine learning algorithms, it is often helpful to study two-dimensional or one-dimensional data, that is data with only one or two features. While in practice, datasets usually have many more features, it is hard to plot high-dimensional data on two-dimensional screens.

We will illustrate some very simple examples before we move on to more "real world" data sets.

Classification

First, we will look at a two class classification problem in two dimensions. We use the synthetic data generated by the make_blobs function.


In [ ]:
from sklearn.datasets import make_blobs
X, y = make_blobs(centers=2, random_state=0)
print(X.shape)
print(y.shape)
print(X[:5, :])
print(y[:5])

As the data is two-dimensional, we can plot each sample as a point in two-dimensional space, with the first feature being the x-axis and the second feature being the y-axis.


In [ ]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=40)
plt.xlabel("first feature")
plt.ylabel("second feature")

As classification is a supervised task, and we are interested in how well the model generalizes, we split our data into a training set, to built the model from, and a test-set, to evaluate how well our model performs on new data. The train_test_split function form the cross_validation module does that for us, by randomly splitting of 25% of the data for testing.


In [ ]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

The scikit-learn estimator API

Every algorithm is exposed in scikit-learn via an ''Estimator'' object. For instance a logistic regression is:


In [ ]:
from sklearn.linear_model import LogisticRegression

All models in scikit-learn have a very consistent interface. First, we instantiate the estimator object.


In [ ]:
classifier = LogisticRegression()

To built the model from our data, that is to learn how to classify new points, we call the fit function with the training data, and the corresponding training labels (the desired output for the training data point):


In [ ]:
classifier.fit(X_train, y_train)

We can then apply the model to unseen data and use the model to predict the estimated outcome using the predict method:


In [ ]:
prediction = classifier.predict(X_test)

We can compare these against the true labels:


In [ ]:
print(prediction)
print(y_test)

We can evaluate our classifier quantitatively by measuring what fraction of predictions is correct. This is called accuracy:


In [ ]:
np.mean(prediction == y_test)

There is also a convenience function , score, that all scikit-learn classifiers have to compute this directly from the test data:


In [ ]:
classifier.score(X_test, y_test)

It is often helpful to compare the generalization performance (on the test set) to the performance on the training set:


In [ ]:
classifier.score(X_train, y_train)

LogisticRegression is a so-called linear model, that means it will create a decision that is linear in the input space. In 2d, this simply means it finds a line to separate the blue from the red:


In [ ]:
from figures import plot_2d_separator

plt.scatter(X[:, 0], X[:, 1], c=y, s=40)
plt.xlabel("first feature")
plt.ylabel("second feature")
plot_2d_separator(classifier, X)

Estimated parameters: All the estimated parameters are attributes of the estimator object ending by an underscore. Here, these are the coefficients and the offset of the line:


In [ ]:
print(classifier.coef_)
print(classifier.intercept_)

Another classifier: K Nearest Neighbors

Another popular and easy to understand classifier is K nearest neighbors (kNN). It has one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.

The interface is exactly the same as for LogisticRegression above.


In [ ]:
from sklearn.neighbors import KNeighborsClassifier

This time we set a parameter of the KNeighborsClassifier to tell it we only want to look at one nearest neighbor:


In [ ]:
knn = KNeighborsClassifier(n_neighbors=1)

We fit the model with out training data


In [ ]:
knn.fit(X_train, y_train)

In [ ]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=40)
plt.xlabel("first feature")
plt.ylabel("second feature")
plot_2d_separator(knn, X)

Exercise

Apply the KNeighborsClassifier to the iris dataset. Play with different values of the n_neighbors and observe how training and test score change.